Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Support heterogeneous environment service #3097

Merged
merged 54 commits into from
Dec 15, 2020

Conversation

SparkSnail
Copy link
Contributor

@SparkSnail SparkSnail commented Nov 17, 2020

  • Support remote
  • Support azureml
  • Support local, 11.25
  • Support PAI 11.25
  • doc 11.25
  • Refactor code, fix bugs and pr ready for review 11.26

SparkSnail added 22 commits May 29, 2020 17:02
@SparkSnail SparkSnail mentioned this pull request Nov 17, 2020
77 tasks
ts/nni_manager/main.ts Outdated Show resolved Hide resolved
});
await Promise.all(tasks);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if AML client sdk has the capability to do it in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted aml change here.

@@ -96,7 +100,7 @@ export class OpenPaiEnvironmentService extends EnvironmentService {
}

const getJobInfoRequest: request.Options = {
uri: `${this.protocol}://${this.paiClusterConfig.host}/rest-server/api/v2/jobs?username=${this.paiClusterConfig.userName}`,
uri: `${this.protocol}://${this.paiClusterConfig.host}/rest-server/api/v2/jobs/${this.paiClusterConfig.userName}~${environment.envId}`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes significant performance issue, and may trigger OpenPAI threshold to block API calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted to use jobs API.

import { GpuScheduler } from './gpuScheduler';
import { MountedStorageService } from './storages/mountedStorageService';
import { StorageService } from './storageService';
import { TrialDetail } from './trial';
import { AMLEnvironmentService } from './environments/amlEnvironmentService';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be load in config, not hard code all subclasses here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't get point, how to load config?


private readonly trials: Map<string, TrialDetail>;
private readonly environments: Map<string, EnvironmentInformation>;
// make public for ut
public environmentServiceList: EnvironmentService[] = [];
public commandChannelDict: Map<Channel, CommandChannel>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be in environmentService itself. All use of this map is related to environment service. Explained in above comments, in some cases, it may not be a single instance. Maintain such a map in this level is not necessary and not a good practice also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@@ -62,6 +71,8 @@ class TrialDispatcher implements TrainingService {
private enableGpuScheduler: boolean = false;
// uses to save if user like to reuse environment
private reuseEnvironment: boolean = true;
private logCollection: string = '';
private environmentMaintenceLoopInterval: number = 5000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to read it from all services, and take max one. It can improve performance on local and other lightweight services.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@@ -122,13 +132,7 @@ class TrialDispatcher implements TrainingService {

const trialId: string = uniqueString(5);

const environmentService = component.get<EnvironmentService>(EnvironmentService);
let trialWorkingFolder: string = "";
if (environmentService.hasStorageService) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this logic now? It may break OpenPAI with shared folders.

Copy link
Contributor Author

@SparkSnail SparkSnail Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 705 in trialDispatcher.ts.

for(const platform of platforms) {
let environmentService: EnvironmentService;
switch(platform) {
case 'local':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is not necessary like this, it should be initialized like a factory pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, add environmentServiceFactory class to genreate environmentService.

throw new Error(`${platform} not supported!`);
}
if (!this.commandChannelDict.has(environmentService.getCommandChannelName)) {
switch(environmentService.getCommandChannelName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, channel should be from service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

for (let index = 0; index < number; index++) {
await this.requestEnvironment();
// Schedule a environment platform for environment
private selectEnvironmentService(): EnvironmentService | undefined {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. It can be in GPU scheduler easily in future.

@SparkSnail SparkSnail closed this Dec 11, 2020
@SparkSnail SparkSnail reopened this Dec 11, 2020
@SparkSnail SparkSnail merged commit 872554f into microsoft:master Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants